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The next generation of High Energy Physics experiments requires a GRID approach to a distributed computing 
system and the associated data management: the key concept is the Virtual Organisation(VO), a group of 
geographycally distributed users with a common goal and the will to share their resources. A similar approach is 
being applied to a group of Hospitals which joined the GPCALMA project (Grid Platform for Computer Assisted 
Library for MAmmography) , which will allow common screening programs for early diagnosis of breast and, in 
the future, lung cancer. HEP techniques come into play in writing the application code, which makes use of 
neural networks for the image analysis and shows performances similar to radiologists in the diagnosis. GRID 
technologies will allow remote image analysis and interactive online diagnosis, with a relevant reduction of the 
delays presently associated to screening programs. 



1. Introduction 

A reduction of breast cancer mortality in 
asymptomatic women is possible in case of early 
diagnosis 1 , which is available thanks to screen- 
ing programs, a periodical mammographic exami- 
nation performed for 49-69 years old women. The 
GPCALMA Collaboration aims at the develop- 
ment of tools that would help in the early diagno- 
sis of breast cancer: Computer Assisted Detection 
(CAD) would significantly improve the prospects 
for mammographic screening, by quickly provid- 
ing reliable information to the radiologists. 

A dedicated software to search for massive le- 
sions and microcalcification clusters was devel- 
oped recently (1998-2001): its best results in 
the search for massive lesions (microcalcification 
clusters) are 94% (92%) for sensitivity and 95% 
(92%) for specificity. Meanwhile, in view of 
the huge distributed computing effort required 
by the CERN/LHC collaborations, several GRID 
projects were started. It was soon understood 
that the application of GRID technologies to a 



database of mammographic images would facil- 
itate a large-scale screening program, providing 
transparent and real time access to the full data 
set. 

The data collection in a mammographic screen- 
ing program will intrinsically create a distributed 
database, involving several sites with different 
functionality: data collection sites and diagnos- 
tic sites, i.e. access points from where radiolo- 
gists would be able to query/analyze the whole 
distributed database. The scale is pretty similar 
to that of LHC projects: taking Italy as an ex- 
ample, a full mammographic screening program 
would act on a target sample of about 6.8 mil- 
lion women, thus generating 3.4 millions mam- 
mographic exams/year. With an average size of 
60 MB /exam, the amount of raw data would be 
in the order of 200 TB /year: a screening program 
on the European scale would be a data source 
comparable to one of the LHC experiments. 

GPCALMA was proposed in 2001, with the 
purpose of developing a GRID application-,- based 
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on technologies similar to those adopted by the 
CERN/ALICE Collaboration. In each hospital, 
digital images will be stored in the local database 
and registered to a common service (Data Cata- 
logue). Data describing the mammograms, also 
known as metadata, will also be stored in the 
Data Catalogue and could be used to define an 
input sample for any kind of epidemiology study. 
The algorithm for the image analysis will be sent 
to the remote site where images are stored, rather 
than moving them to the radiologist's sites. A 
preliminary selection of cancer candidates will be 
quickly performed and only mammograms with 
cancer probabilities higher than a given thresh- 
old would be transferred to the diagnostic sites 
and interactively analysed by one or more radiol- 
ogists. 

2. The GPCALMA CAD Station 

The hardware requirements for the GPCALMA 
CAD Station are very simple: a PC with SCSI 
bus connected to a planar scanner and to a 
high resolution monitor. The station can pro- 
cess mammograms directly acquired by the scan- 
ner and/or images from hie and allows human 
and/or automatic analysis of the digital mammo- 
gram. The software configuration for the use in 
local mode requires the installation of ROOT 2 
and GPCALMA, which can be downloaded ei- 
ther in the form of source code from the respec- 
tive CVS servers. The functionality is usually 
accessed through a Graphic User Interface, or, 
for developers, the ROOT interactive shell. The 
Graphic User Interface (fig. ^1 allows the acquisi- 
tion of new data, as well as the analysis of exist- 
ing ones. Three main menus drive the creation of 
(access to) datasets at the patient and the image 
level and the execution of CADe algorithms. The 
images are displayed according to the standard 
format required by radiologists: for each image, 
it is possible to insert or modify diagnosis and an- 
notations, manually select the Regions of Interest 
(ROI) corresponding to the radiologists geomet- 
rical indication. An interactive procedure allows 
zooming, either continously or on a selected re- 
gion, windowing, gray levels and contrast selec- 
tion, image inversion, luminosity tuning. The 
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Figure 1. The GPCALMA Graphic User Inter- 
face. Three menus, corresponding to the Patient, 
the Images and the CAD diagnosis levels, drive 
it. On the left, the CAD results for microcal- 
ciheations and masses are shown in red squares 
and green circles, together with the radiologist's 
diagnosis (blue circle). On the right, the image 
colours are inverted. The widget drives the up- 
date of patient and image related metadata. 



human analysis produces a diagnosis of the breast 
lesions in terms of kind, localization on the image, 
average dimensions and, if present, histological 
type. The automatic procedure finds the ROI's 
on the image with a probability of containing an 
interesting area larger than a pre-selected thresh- 
old value. 

3. Grid Approach 

The amount of data generated by a national 
or european screening program is so large that 
they can't be managed by a single computing cen- 
tre. In addition, data are generated according to 
an instrinsically distributed pattern: any hospital 
participating to the program will collect a small 
fraction of the data. Still, that amount would 
be large enough to saturate the available network 
connections. 

The availability of the whole database to a radi- 
ologist, regardless of the data distribution, would 
provide several advantages: 



3 



• the CAD algorithms could be trained on a 
much larger data sample, with an improve- 
ment on their performance, in terms of both 
sensitivity and specificity. 

• the CAD algorithms could be used as real 
time selectors of images with high breast 
cancer probability (see fig. 0): radiologists 
would be able to prioritise their work, with 
a remarkable reduction of the delay between 
the data acquisition and the human diagno- 
sis (it could be reduced to a few days). 

• data associated to the images (i.e., meta- 
data) and stored on the distributed system 
would be available to select the proper input 
for epidemiology studies or for the training 
of young radiologists. 

These advantages would be granted by a GRID 
approach: the configuration of a Virtual Organ- 
isation, with common services (Data and Meta- 
data Catalogue, Job Scheduler, Information Sys- 
tem) and a number of distributed nodes provid- 
ing computing and storage resources would al- 
low the implementation of the screening, tele- 
training and epidemiology use cases. However, 
with respect to the model applied to High En- 
ergy Physics, there are some important differ- 
ences: the network conditions do not allow the 
transfer of large amounts of data, the local nodes 
(hospitals) do not agree on the raw data transfer 
to other nodes as a standard and, most impor- 
tant, some of the use cases require interactivity. 

According to these restrictions, our approach to 
the implementation of the GPCALMA Grid ap- 
plication is based on two different tools: AliEn 3 
for the management of common services, PROOF 
for the interactive analysis of remote data 
without data transfer. 

3.1. Data Management 

The GPCALMA data model foresees several 
Data Collection Centres |5] , where mammograms 
are collected, locally stored and registered in the 
Data Catalogue. In order to make them avail- 
able to a radiologist connecting from a Diagnostic 
Centre, it is mandatory to use a mechanism that 
identifies the data corresponding to the exam in 
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Figure 2. The screening use case: Data Collection 
Centres store and register the images and the as- 
sociated metadata in the AliEn Data Catalogue. 
Radiologists, from Diagnosis Centres, start the 
CAD remotely, without raw data transfer, mak- 
ing use of PROOF. Only the images correspond- 
ing to cancer probability larger than the selected 
threshold are moved to the Diagnosis Centre for 
the real-time visual inspection. Eventually, the 
small fraction of undefined cases can be sent to 
other radiologists. 



a site-independent way: they must be selected by 
means of a set of requirements on the attached 
metadata and identified through a Logical Name 
which must be independent of their physical loca- 
tion. AliEn implements these features in its Data 
Catalogue Services, run by the Server: data are 
registered making use of a hierarchical namespace 
for their Logical Names and the system keeps 
track of their association to the actual name of the 
physical files. In addition, it is possible to attach 
metadata to each level of the hierarchical names- 
pace. The Data Catalogue is browsable from the 
AliEn command line as well as from the Web 
portal; the CH — h Application Program Interface 
(API) to ROOT is under development. Metadata 
associated to the images can be classified in sev- 
eral categories: patient and exam identification 
data, results of the CAD algorithm analysis, ra- 
diologist's diagnosis, histological diagnosis, etc.. 
Some of these data will be directly stored in the 
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Data Catalogue, but some of them may be stored 
in dedicated files and registered: the decision will 
be made after a discussion with the radiologists. 

A dedicated AliEn Server for GPCALMA has 
been configured A, in collaboration with the 
AliEn development team. Fig. [3] shows a screen- 
shot from the WEB Portal. 
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Figure 3. Screenshot from the GPCALMA AliEn 
WEB Portal. Making use of the left side frame, 
the site can be navigated. General Information 
about the AliEn project, the installation and con- 
figuration guides, the status of the Virtual Or- 
ganisation Services can be accessed. On the main 
frame, the list of the core services is shown, to- 
gether with their status. 

3.2. Remote Data Processing 

Tele-diagnosis and tele-training require interac- 
tivity in order to be fully exploited, while in the 
case of screening it would be possible - altough 
not optimal - to live without. The PROOF Par- 
allel ROOt Facility system [2J allows to run in- 
teractive parallel processes on a distributed clus- 
ter of computers. A dedicated cluster of several 
PCs was configured and the remote analysis of a 
digitised mammogram without data transfer was 
recently run. As soon as input selection from the 
AliEn Data Catalogue will be possible, more com- 
plex use cases will be deployed. The basic idea is 
that, whenever a list of input Logical Names will 
be selected, that will be split into a number of 



sub-lists containing all the files stored in a given 
site and each sub-list will be sent to the corre- 
sponding node, where the mammograms will be 
analysed. 

4. Present Status and Plans 

The project is developing according to the orig- 
inal schedule. The CAD algorithms were rewrit- 
ten in C++, making use of ROOT, in order to be 
PROOF-compliant; moreover, the ROOT func- 
tionality allowed a significant improvement of the 
Graphic User Interface, which, thanks to the pos- 
sibility to manipulate the image and the asso- 
ciated description data, is now considered fully 
satisfactory by the radiologists involved in the 
project. The GPCALMA application code is 
available via CVS server for download and in- 
stallation; a script to be used for the node con- 
figuration is being developed. The AliEn Server, 
which describes the Virtual Organisation and 
manages its services, is installed and configured; 
some AliEn Clients are in use, and they will soon 
be tested with GPCALMA jobs. The remote 
analysis of mammograms was successfully accom- 
plished making use of PROOF. Presently, all but 
one the building blocks required to implement the 
tele-diagnosis and screening use cases were de- 
ployed. As soon as the implementation of the 
data selection from the ROOT shell through the 
AliEn C++ API will be available, GPCALMA 
nodes will be installed in the participating hos- 
pitals and connected to the AliEn Server, hosted 
by INFN. Hopefully, that task will be completed 
by the end of 2004. 
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